Introduction to Statistics
Bennett Kleinberg
Week 1
Week 1
- Why do we even need statistics?
- About the course
- Basic ideas
- Frequency distributions
Getting started
![]()
Another one
About Maria
Maria is 26 years old, single, outspoken, and very bright. She majored in law. As a student, she was deeply concerned with issues of discrimination and miscarriage of justice and participated in weekly animal-rights demonstrations.
Adapted from Tversky & Kahneman (1983)
Which is more probable?
- A: Maria works in a law firm
- B: Maria works in a law firm and does pro bono work for animal-rights activists
Hollywood ruins books (does it?)
Good books become bad movies!
(demo)
Berkson’s paradox
Also holds for attractiveness and niceness in dating
Book tip: Jordan Ellenberg “How not to be wrong”
YT video from Numberphile
Why should I care?
![]()
- we are flooded with data
- we want to make sense of the world around us
- … esp. about human behaviour and society
Statistics is the best way to do this.
Suppose you wanted to know…
- whether loneliness increased during lockdown?
- how much more dangerous COVID-19 is for people with cancer?
- how engagement in online communities relates to extremist world views?
- whether a curfew increases rioting?
Statistics is not a good way to approach these questions.
It is the ONLY way to meaningfully approach these questions!
What does it even mean?
Statistics, the science of collecting, analyzing, presenting, and interpreting data. Britannica
A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. Merriam-Webster
Note: this is \(\neq\) “statistics as a collection of data”
Synopsis of statistics
- we work with data in a numerical sense
- we want to obtain information from these data
- and we want to understand the uncertainty that comes with data
- this is one aspect where it differs from mathematical modelling
And lastly: the word data is the plural form of datum.
The data never lie!?
- people will use statistics to make their points
- this can be used to mislead
- you must become statistics-savvy to call bullsh*t
Nope: still not interested!
- social + behavioural sciences have embraced quantitative methods
- we seek to express processes/attributes/disorders as numbers
- so we also need methods to make sense of these numbers
The special role for Psychology
![]()
Measurement challenge
- Human behaviour and social processes are very complex
- Compare this to a drop of oil or the properties of gold
- We are often interested in the unobservables:
- intelligence
- well-being
- emotions (fear, sadness, …)
- loneliness
- These are very hard to measure!
- And we need methods to learn about humans in general (= the population)
This is the essence of inferential statistics.
Two stances towards statistics
- Statistics as a tool
- you use it to serve your purpose (e.g. making an inference based on your data)
- you have a pragmatic relationship with statistics (e.g. it’s needed to do research and to understand the world)
- Statistics as a discipline
- about improving statistics
- about better ways to model data, make inferences, quantify uncertainty
- esp. now: making sense of massive volumes of data (never use the term Big Data)
My promise
- Basic statistics today is what reading was yesterday
- If you invest the time to fully understand the content in this module (always ask if things are unclear), you will be fine
- Every more advanced approach builds on these basic ideas
If you are super-pragmatic: being able to do statistics pays well in the industry.
The course: structure
- Lectures (14x)
- Seminars (4x)
- SPSS practicals (3x)
Lectures
- weekly video content
- weekly (live) in-depth session
- incl. Q&A
Seminars
- led by teaching assistants
- scheduled in B3W4, B3W8, B4W3, B4W6
- walk-through of exercises
SPSS seminars
- led by teaching assistants
- coordinator: Ghislaine van Bommel
- about implementing tests in SPSS
- first exposure to statistical software
Our expectation
| Lectures |
14 |
2h |
28h |
| Seminars |
4 |
2h |
8h |
| SPSS labs |
3 |
2h |
6h |
| Weekly revision/self-study/preparation |
16 |
6h |
96h |
| Assessment: SPSS exam |
1 |
2h |
2h |
| Assessment: main exam |
1 |
3h |
3h |
| TOTAL |
- |
- |
~140 |
Our expectation
- prepare the lectures
- watch/attend the lectures and revise them
- make use of the seminars
- do the homework
Materials
- Statistics for the Behavioral Sciences (Gravetter & Wallnau)
- SPSS survival manual (Pallant)
The course: Piazza
- online Q&A platform
- when in doubt: always ask!
- we will answer questions and review your answers
- (watch the “introduction to Piazza” session)
SPSS test
- assesses your ability to perform analyses in SPSS
- all content from the book + practicals
- also tests the ability to interpret results
- computerised test
- Outcome: PASS/FAIL
Main exam
- multiple-choice questions (e.g. correct vs incorrect; 4 options)
- standard 1-10 grade scale
- needed: 5.5 (after guessing-level correction)
- date and form to be confirmed
Basic ideas in statistics
- The idea of “data”
- Types of statistical thinking
- First look at distributions
Approaches of statistics
Descriptive statistics
- about describing the data
- often through summary statistics (Week 2)
- e.g. on average a Spanish women is 1.63m tall
- e.g. The wealthiest 1% own 50% of the equity/shares in companies
Approaches of statistics
Inferential statistics
- we want to make an inference from something to something else
- here: we want to make an inference from the sample to the population
Inferential statistics
![]()
data \(\neq\) data
- Height (in cm)
- Annual income (in EUR)
- Smoker vs. non-smoker
- Pet (dog, cat, hamster, bunny)
- Support for Trump (from -5 to +5)
Dimensions of the “data” idea
- Constructs vs operationalisations
- Discrete vs continuous variables
- Different measurement levels
Constructs vs operationalisations
![]()
Constructs vs operationalisations
![]()
Discrete vs continuous variables
Some variables can only consist of a limited number of categories:
- e.g. gender, eye color, native language
- but also: no. of pets, no. of siblings, how often were on holiday
There cannot be a value between 1 and 2 pets.
These variables are called discrete variables
Discrete vs continuous variables
Other variables can take all values between two points:
- e.g. income, height, weight, speed
- your height can, in principle, be expressed as 1.75123461736823837423 meters
- thus a value of a continuous variable (e.g. 1.75m) is actually an interval
Measuring variables
The nominal scale
- named categories (e.g., dog, cat, hamster)
- no quantitative distinction between them (you cannot say a dog is more than a cat)
- no zero!
Measuring variables
The ordinal scale
- ranked named categories (e.g., 1st, 2nd, 3rd)
- no equal distance between ranks
- no zero!
Measuring variables
The interval scale
- consists of equally-sized intervals between values
- each unit has the same size
- e.g. temperature:
- going from \(21^{\circ}C\) to \(26^{\circ}C\)
- going from \(1^{\circ}C\) to \(6^{\circ}C\)
- both have the same difference
- but: no real zero! (arbitrarily chosen)
Measuring variables
The ratio scale
- consists of equally-sized intervals between values
- each unit has the same size
- but now we do have an absolute zero
- e.g. distance: a distance of zero means your bike has not moved!
Representing data
Today:
- data as a frequency distribution
- ways to represent data
- describing the location of datapoints
Example
How many pets do you have?
- we ask 10 people
- they state the number of pets that currently lives in their household
Remember:
- the construct is “number of pets”
- the operationalisation is “the number of pets that currently live in a person’s main household”
Our data
| 1 |
0 |
| 2 |
2 |
| 3 |
2 |
| 4 |
3 |
| 5 |
0 |
| 6 |
1 |
| 7 |
3 |
| 8 |
1 |
| 9 |
1 |
| 10 |
0 |
We may want some more structure
- maybe we can count how often each option occurs
- i.e. how many people have 0, 1, 2, … pets?
These are called the frequencies of values.
Frequencies of values
A structured table is then called a frequency distribution table.
Another example
- someone’s gender
- possible options: male - female - prefer-not-to-say
| female |
55 |
| male |
38 |
| p-n-t-s |
7 |
Freq. distributions fo continuous variables
| 31 |
37900 |
| 32 |
37300 |
| 33 |
17000 |
| 34 |
45300 |
| 35 |
25800 |
| 36 |
33600 |
| 37 |
89000 |
| 38 |
20200 |
| 39 |
57900 |
| 40 |
20700 |
Problem for a table?
| 20700 |
1 |
| 21300 |
2 |
| 22400 |
1 |
| 22800 |
1 |
| 22900 |
1 |
| 23700 |
1 |
| 25100 |
1 |
| 25800 |
1 |
| 26700 |
2 |
| 27900 |
1 |
Grouped frequency distributions
Idea:
- we bundle some value ranges together
- we can probably lose some measurement precision here
- example:
- low (0-25000)
- middle (25001-50000)
- upper-middle (50001-75000)
- high (75001+)
Grouped income data
| high |
30 |
| low |
27 |
| middle |
24 |
| upper-middle |
19 |
Is this ideal?
What if we have these two data collections?
- no. of pets (\(n=10\))
- no. of pets (\(n=10000\))
What do we expect to see?
Comparing the tables
| 0 |
2991 |
| 1 |
3057 |
| 2 |
2997 |
| 3 |
472 |
| 4 |
483 |
Solution: proportions
| 0 |
2991 |
0.2991 |
| 1 |
3057 |
0.3057 |
| 2 |
2997 |
0.2997 |
| 3 |
472 |
0.0472 |
| 4 |
483 |
0.0483 |
Proportion: \(p = \frac{f}{N}\)
… and percentages
| 0 |
2991 |
0.2991 |
29.91 |
| 1 |
3057 |
0.3057 |
30.57 |
| 2 |
2997 |
0.2997 |
29.97 |
| 3 |
472 |
0.0472 |
4.72 |
| 4 |
483 |
0.0483 |
4.83 |
Percentages: \(p = \frac{f}{N}*100\)
Visual representation

Histograms

Histograms (proportions)

Histograms comparison

Locating data points
- we may want to find where a value lies relative to the whole data
- e.g. Are 3 pets a lot or not?
- Where does an income of \(X=40,000\) lie in our data?
We can locate points based on the frequency distribution.
Percentiles
- We sort our frequency table
| 0 |
2991 |
0.2991 |
29.91 |
| 1 |
3057 |
0.3057 |
30.57 |
| 2 |
2997 |
0.2997 |
29.97 |
| 3 |
472 |
0.0472 |
4.72 |
| 4 |
483 |
0.0483 |
4.83 |
Percentiles
- We sort our frequency table
- We calculate a cumulative percentage (same for proportions)
| 0 |
2991 |
0.2991 |
29.91 |
29.91 |
| 1 |
3057 |
0.3057 |
30.57 |
60.48 |
| 2 |
2997 |
0.2997 |
29.97 |
90.45 |
| 3 |
472 |
0.0472 |
4.72 |
95.17 |
| 4 |
483 |
0.0483 |
4.83 |
100.00 |
Percentiles
- We sort our frequency table
- We calculate a cumulative percentage (same for proportions)
- We locate our data point of interest (here: having 3 pets)
| 0 |
2991 |
0.2991 |
29.91 |
29.91 |
| 1 |
3057 |
0.3057 |
30.57 |
60.48 |
| 2 |
2997 |
0.2997 |
29.97 |
90.45 |
| 3 |
472 |
0.0472 |
4.72 |
95.17 |
| 4 |
483 |
0.0483 |
4.83 |
100.00 |
Interpreting percentiles
- We know that 3 pets corresponds to a cumulative percentage of 95.17%
- i.e. 95.17% of our data has been accumulated once we reach 3 pets (inclusive)
- 95.17% of responses are covered by 0, 1, 2, or 3 pets.
“3 pets” has a percentile rank of 95.17%
“3 pets” is the 95th percentile
Income data
| 800 |
1 |
1.0526 |
1.0526 |
| 1100 |
1 |
1.0526 |
2.1052 |
| 1500 |
1 |
1.0526 |
3.1578 |
| 4700 |
1 |
1.0526 |
4.2104 |
| 5700 |
1 |
1.0526 |
5.2630 |
| 9200 |
1 |
1.0526 |
6.3156 |
| 9300 |
1 |
1.0526 |
7.3682 |
| 10300 |
1 |
1.0526 |
8.4208 |
| 10400 |
1 |
1.0526 |
9.4734 |
| 11100 |
1 |
1.0526 |
10.5260 |
Obtaining percentiles
Where does an income of \(X=40000\) lie in our data?
| 37800 |
1 |
1.0526 |
46.3146 |
| 37900 |
1 |
1.0526 |
47.3672 |
| 38500 |
1 |
1.0526 |
48.4198 |
| 41900 |
1 |
1.0526 |
49.4724 |
| 43600 |
1 |
1.0526 |
50.5250 |
An income of 40,000 has a percentile rank of 48.42%.
Recap
- intro to the module
- first steps
- frequency distributions
- locating data points
Next week
Understanding data further:
- central tendency of data
- variability of data